Creating and Weighting Hunspell Dictionaries as Finite-State Automata

نویسندگان

  • Tommi A Pirinen
  • Krister Lindén
چکیده

There are numerous formats for writing spell-checkers for open-source systems and there are many lexical descriptions for natural languages written in these formats. In this paper, we demonstrate a method for converting Hunspell and related spell-checking lexicons into finite-state automata. We also present a simple way to apply unigram corpus training in order to improve the spellchecking suggestion mechanism using weighted finite-state technology. What we propose is a generic and efficient language-independent framework of weighted finite-state automata for spell-checking in typical open-source software, e.g. Mozilla Firefox, OpenOffice and the Gnome desktop.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reduction of Computational Complexity in Finite State Automata Explosion of Networked System Diagnosis (RESEARCH NOTE)

This research puts forward rough finite state automata which have been represented by two variants of BDD called ROBDD and ZBDD. The proposed structures have been used in networked system diagnosis and can overcome cominatorial explosion. In implementation the CUDD - Colorado University Decision Diagrams package is used. A mathematical proof for claimed complexity are provided which shows ZBDD ...

متن کامل

Compiling Apertium morphological dictionaries with HFST and using them in HFST applications

In this paper we aim to improve interoperability and re-usability of the morphological dictionaries of Apertium machine translation system by formulating a generic finite-state compilation formula that is implemented in HFST finite-state system to compile Apertium dictionaries into general purpose finite-state automata. We demonstrate the use of the resulting automaton in FST-based spell-checki...

متن کامل

Finite automata for compact representation of tuple dictionaries

A generalization of the dictionary data structure is described, called tuple dictionary. A tuple dictionary represents the mapping of n-tuples of strings to some value. This data structure is motivated by practical applications in speech and language processing, in which very large instances of tuple dictionaries are used to represent language models. A technique for compact representation of t...

متن کامل

Applications of Finite Automata Representing Large Vocabularies Applications of Finite Automata Representing Large Vocabularies

The construction of minimal acyclic deterministic partial nite automata to represent large natural language vocabularies is described. Applications of such automata include: spelling checkers and advisers, multilanguage dictionaries, thesauri, minimal perfect hashing and text compression. Part of this research was supported by a grant awarded by the Brazilian National Council for Scienti c and ...

متن کامل

A Finite-State Library for NLP

A library of functions is described which use finite-state automata for compact storage and efficient usage of very large dictionaries and language models. The library can be used to test whether a word is in a dictionary, to perform morphological analysis, to construct perfect hash tables, and to construct and use very large language models (such as models which employ bigram and trigram frequ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011